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DETAILED ACTION 



Claim Rejections - 35 USC § 103 
1. The following is a quotation of 35 U.S.C. 103(a) which forms the basis for all 
obviousness rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set 
forth in section 102 of this title, if the differences between the subject matter sought to be patented and 
the prior art are such that the subject matter as a whole would have been obvious at the time the 
invention was made to a person having ordinary skill in the art to which said subject matter pertains. 
Patentability shall not be negatived by the manner in which the invention was made. 



2. Claims 1,2,4,6,9-16,18,20,23-37 are rejected under 35 U.S.C. 103(a) as being 
unpatentable over van den Akker (US 6,415,250) in view of Bracewell et al. (US PGPub 
2006/0041685). 

As to claims 1 and 15, van den Akker discloses a system for identifying 
language attributes through probabilistic analysis (see abstract lines 1-3; column 
3 paragraph 1; column 7 paragraph 4). Van den Akker also discloses a storage 
storing a set of language classes, which identify a language and a character set 
encoding (see column 3 paragraph 2; column 9 paragraph 2; column 11 lines 34- 
39), and a plurality of training documents (see column 9 lines 5-8, 42-46; column 
10 lines 5-10). Van den Akker discloses a text modeler evaluating byte 
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occurrence within each training document and, for each language class, 
calculating a probability for the byte occurrences conditioned on the occurrence 
of the language class (see column 3 lines 15-20, 25-35; column 9 lines 15-20). 
Van den Akker does not specifically disclose having an attribute modeler that 
evaluates occurrences of one or more document properties within each training 
document and, for each language class, calculating a probability for the 
document properties set conditioned on the occurrence of the language class. 
Bracewell teaches that document properties like HTTP header information can 
be used in order to identify the language of a document or search query on the 
internet (see [0014]). It would have been obvious to have used the attribute 
properties, such as the HTTP header information, as disclosed in Bracewell in a 
language identification model taught by van den Akker because using text and 
document property will result in an efficient form of identifying the language of a 
document or search query (see [0014]). 

As to claims 2 and 16, van den Akker discloses a training engine 
calculating an overall probability for each language class by evaluating the 
probability for the document property set and the probability for the byte 
occurrences (see column 3 lines 35-45, 64-67; column 4 lines 50-55; column 10 
lines 33-40). 
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As to claims 4 and 18, van den Akker does not disclose specifically that 
the document. properties comprise at least one of top level domain, HTTP content 
character set encoding and language header parameters, and HTML content 
character set encoding and language metatags. Bracewell teaches that 
document properties like HTTP header information can be used in order to 
identify the language of a document or search query on the Internet (see [0014]). 
It would have been obvious to have used the attribute properties, such as the 
HTTP header information, as disclosed in Bracewell in a language identification 
model taught by van den Akker because using text and document property will 
result in an efficient form of identifying the language of a document or search 
query (see [0014]). 



As to claims 6 and 20, van den Akker discloses a counting module 
counting byte co-occurrences within each training document, and determining the 
probability for the byte occurrences based on the byte co-occurrences (see 
column 3 lines 15-20, 50-55; column 12 lines 50-55). 

As to claims 9 and 23, van den Akker discloses a training engine 
performing iterative training by providing the probability for the document 
properties set and the probability for the byte occurrences set respectively to the 
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evaluation of byte occurrences and assignment of the set of language classes 
(see column 3 lines 35-45, 64-67; column 4 lines 50-55; column 10 lines 33-40). 



As to claims 10 and 24, van den Akker discloses a back off module 
evaluating less frequently occurring document properties by calculating a 
probability for each less frequently occurring document property conditioned on 
the occurrence of the language class (see column 10 45-65). 

As to claims 1 1 and 25, van den Akker discloses a plurality of unlabeled 
documents ( see column 7 lines 50-55) and a classifier classifying one or more 
unlabeled documents by at least one language class (see column 5 lines 36-44), 
comprising evaluating occurrences of one or more document properties within 
the unlabeled document, evaluating byte occurrences within the unlabeled 
document, and assigning a probability for the document properties set and the 
byte occurrences for the unlabeled document conditioned on the occurrence of 
the language class (see column 7 lines 55-67) 

As to claims 12 and 26, van den Akker discloses a class selector selecting 
at least one language class having a substantially highest probability (see 
column 5 paragraph 2 lines23-26). 
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As to claims 13 and 24, van den Akker discloses a probability threshold 
and a pruner pruning at least one language class falling below the probability 
threshold (see column 13 paragraph 2 lines 17-25). 

As to claims 14 and 28, van den Akker discloses training document 

» 

comprising one of a Web page and a news message (see column 4 lines 15-17; 
column 5 lines 27-37; column 7 lines 56-62). 

As to claims 29 and 36, van den Akker discloses a computer-readable 
storage medium holding code for performing the method of identifying language 
attributes through probabilistic analysis (see column 6 lines 39-60). 

As to claims 30 and 33, van den Akker discloses a system for identifying 
documents by language using probabilistic analysis of language attributes (see 
abstract lines 1-3; column 3 paragraph 1; column 7 paragraph 4), comprising: a 
set of language classes, each language class comprising a language and a 
character set encoding name (see column 3 paragraph 2; column 9 paragraph 2; 
column 11 lines 34-39); a training corpora comprising a plurality of training 
documents (see column 9 lines 5-8, 42-46; column 10 lines 5-10); and a text 
modeler training a text model by evaluating co-occurrences of a plurality of bytes 
within each training document and , for each language class, calculating a 
probability for the byte co-occurrences conditioned on the occurrence of the each 
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langue class (see column 3 lines 15-20, 25-35; column 9 lines 15-20). Van den 
Akker does not specifically disclose an attribute modeler training an attribute 
model by evaluating a top level domain and character set encoding associated 
with each training document and, for each language class, calculating a 
probability for each such top level domain and character set encoding 
conditioned on the occurrence of the each language class. Bracewell teaches 
that document properties like HTTP header information can be used in order to 
identify the language of a document or search query on the internet (see [0014]). 
It would have been obvious to have used the attribute properties, such as the 
HTTP header information, as disclosed in Bracewell in a language identification 
model taught by van den Akker because using text and document property will 
result in an efficient form of identifying the language of a document or search 
query (see [0014]). 

As to claims 31 and 34, van den Akker discloses a training engine 
calculating an overall probability for each language class by evaluating the 
probability for the top level domain and character set encoding based on the 
attribute model and the probability for the byte occurrences based on the text 
model (see column 3 lines 35-45, 64-67; column 4 lines 50-55; column 10 lines 
33-40). 
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As to claim 32, van den Akker discloses a classifier classifying one or 
more documents (see column 5 lines 36-44), comprising: a text evaluator 
evaluating byte occurrences in each document and applying the text model to the 
evaluated byte occurrences (see column 3 lines 15-20, 25-35; column 9 lines 15- 
20); and an assignment module assigning at least one language class based on 
the applications of the attribute model and the text model (see column 5 lines 24- 
26). Van den Akker does not specifically disclose an attribute evaluator 
evaluating a top level domain and character set encoding in each document and 
applying the attribute model to the evaluated top level domain and character set 
encoding. Bracewell teaches that document properties like HTTP header 
information can be used in order to identify the language of a document or 
search query on the internet (see [0014]). It would have been obvious to have 
used the attribute properties, such as the HTTP header information, as disclosed 
in Bracewell in a language identification model taught by van den Akker because 
using text and document property will result in an efficient form of identifying the 
language of a document or search query (see [0014]). 

As to claim 35, van den Akker discloses classifying one or more 
documents (see column 5 lines 36-44), comprising: evaluating byte occurrences 
in each document and applying the text model to the evaluated byte occurrences 
(see column 3 lines 15-20, 25-35; column 9 lines 15-20); and assigning at least 
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one language class based on the applications of the attribute model and the text 
model (see column 5 lines 24-26). Van den Akker does not specifically disclose 
evaluating a top level domain and character set encoding in each document and 
applying the attribute model to the evaluated top level domain and character set 
encoding. Bracewell teaches that document properties like HTTP header 
information can be used in order to identify the language of a document or 
search query on the internet (see [0014]). It would have been obvious to have 
used the attribute properties, such as the HTTP header information, as disclosed 
in Bracewell in a language identification model taught by van den Akker because 
using text and document property will result in an efficient form of identifying the 
language of a document or search query (see [0014]). 



As to claim 37, van den Akker discloses an apparatus for identifying 
documents by language using probabilistic analysis of language attributes (see 
abstract lines 1-3; column 3 paragraph 1; column 7 paragraph 4), comprising: 
means for defining a set of language classes, each language class comprising a 
language name and a character set encoding name (see column 3 paragraph 2; 
column 9 paragraph 2; column 11 lines 34-39); means for training a text model 
by evaluating co-occurrences of a plurality of bytes within each training 
document, for each language class, calculating a probability for the byte co- 
occurrence conditioned on the occurrence of the language class based on the 
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attribute model (see column 3 lines 15-20, 25-35; column 9 lines 15-20). Van den 
Akker does not disclose specifically means for training an attribute model by 
assigning at least one top level domain and character set encoding pairing to at 
least one language class for each of a plurality of training documents and 
calculating a probability for each such top level domain and character set 
encoding pairing conditioned on the occurrence of the assigned language class. 
Bracewell teaches that document properties like. HTTP header information can 
be used in order to identify the language of a document or search query on the 
internet (see [0014]). It would have been obvious to have used the attribute 
properties, such as the HTTP header information, as disclosed in Bracewell in a 
language identification model taught by van den Akker because using text and 
document property will result in an efficient form of identifying the language of a 
document or search query (see [0014]). 



3. Claims 3, 5, 17, and 19 are rejected under 35 U.S.C. 103(a) as being 
unpatentable over van den Akker (US 6,415,250) in view of Bracewell et al. (US PGPub 
2006/0041685) as applied to claims 1 ,2,4,6,9-16,18,20,23-37 above, and in further view 
of Elworthy (US 6,125,362). 
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As to claims 3 and 17, van den Akker and Bracewell do not specifically 
disclose an assignment module assigning the overall probability for each 
language class in accordance with the formula: arg max 
P(text|cls)*P(props|cls)*P(cls). Elworthy teaches the use of probabilistic analysis 
for determining the language of a text or document (see column 1 lines 37-40). 
Elworthy further teaches that in order to classify documents or text according to a 
certain element the following Bayesian probabilistic formula can be used: p(l|t) = 
(P(t|l)*p(l))/p(t) (see column 4 lines 25-35). If the denominator is passed to the 
left side the resultant equation is: p(l|t) * p(t) = P(t|l)*p(l), where p(l|t) is the 
probability of the classification given the element and p(t) is the probability of the 
element (see column 4 lines 35-45). It would have been obvious to have used 
p(l|t)*p(t) as disclosed in Elworthy for the probabilistic analysis in van den Akker 
as modified, where p(l|t)= P(text|cls)*P(props|cls), where t is the language class 
and I is the text or the attribute property and p(l|t)*p(t) would be the probability of 
the language class given the text and the attribute. It would have been obvious 
to have used both prior arts because using the method described above would 
yield an accurate form of identifying the language model (see column 4 lines 25- 
45). 

As to claims 5 and 19, van den Akker and Bracewell do not specifically 
disclose an assignment module assigning the probability for the document 
properties set in accordance with the formula: P(tld,enc|cls)*P(cls). Elworthy 
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teaches the use of probabilistic analysis for determining the language of a text or 
document (see column 1 lines 37-40). Elworthy further teaches that in order to 
classify documents or text according to a certain element the following Bayesian 
probabilistic formula can be used: p(l|t) = (P(t|l)*p(l))/p(t) (see column 4 lines 25- 
35). If the denominator is passed to the left side the resulatant equation is: p(l|t) 
* p(t) = P(t|l)*p(l), where p(l|t) is the probability of the classification given the 
element and p(t) is the probability of the element (see column 4 lines 35-45). It 
would have been obvious to have used p(l|t)*p(t) as disclosed in Elworthy for the 
probabilistic analysis in van den Akker as modified, where p(l|t)= 
P(tld,enc|cls)*P(cls), where t is the language class and I is the text or the 
attribute property and p(l|t)*p(t) would be the probability of the language class 
given the text and the attribute. It would have been obvious to have used both 
prior arts because using the method described above would yield an accurate 
form of identifying the language model (see column 4 lines 25-45). 



4. Claims 7 and 21 are rejected under 35 U.S.C. 103(a) as being unpatentable over 
van den Akker (US 6,415,250) in view of Bracewell et al. (US PGPub 2006/0041685) as 
applied to claims 1,2,4,6,9-16,18,20,23-37 above, and in further view of de Campos (US 
6,272,456). 



Application/Control Number: 10/634,616 Page 13 

Art Unit: 2609 

As to claims 7 and 21 , van den Akker and Bracewell do not specifically 
disclose using trigrams. De Campos teaches using the byte co-occurrences 
comprise a set of trigrams (see column 1 lines 59-67; column 2 lines 50-54, 59- 
64; column 6 lines 53-60), further comprising a probability module calculating a 
probability of each trigram as the number of occurrences of the trigram divided by 
the total number of trigram occurrences in each of the training documents for 
each language class (see column 18 lines 64-67 and column 19 lines 1-6). It 
would have been obvious to use the trigram method disclosed in de Campos for 
the byte co-occurrences in the text model in van den Akker as modifyied because 
the trigram method would allow the text model to break down the unlabeled text 
and identify the language (see column 18 lines 64-67 and column 19 lines 1-6). 



5. Claims 8 and 22 are rejected under 35 U.S.C. 103(a) as being unpatentable 
over van den Akker (US 6,415,250) in view of Bracewell et'al. (US PGPub 
2006/0041685) as applied to claims 1,2,4,6,9-16,18,20,23-37 above, and in further view 
of de Campos (US 6,272,456) and Elworthy (US 6,125,362). 

As to claims 8 and 22, van den Akker, Bracewell and Campos do not 
specifically disclose an assignment module assigning the overall probability for 
each language class in accordance with the formula: P(text|cls). Elworthy 
teaches the use of probabilistic analysis for determining the language of a text or 
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document (see column 1 lines 37-40). Elworthy further teaches that in order to 
classify documents or text according to a certain element the following Bayesian 
probabilistic formula can be used: p(l|t) = (P(t|l)*p(l))/p(t) (see column 4 lines 25- 
35). If the denominator is passed to the left side the resulatant equation is: p(l|t) 
* p(t) = P(t|l)*p(l), where p(l|t) is the probability of the classification given the 
element and p(t) is the probability of the element (see column 4 lines 35-45). It 
would have been obvious to have used p(l|t) as disclosed in Elworthy for the 
probabilistic analysis in van den Akker as modified, where p(l|t)= P(text|cls), 
where t is the language class and I is the text or set of trigrams (as disclosed by 
de Campos) and p(l|t) would be the probability of the language class given the 
text or trigram. It would have been obvious to have used both prior arts because 
using the method described above would yield an accurate form of identifying the 
language model (see column 4 lines 25-45). 

Conclusion 

Any inquiry concerning this communication should be directed to Josiah 
Hernandez whose telephone number is 571-270-1646. The examiner can 
normally be reached from 7:30 pm to 5:00 pm. 

If attempts to reach the examiner by telephone are unsuccessful, the 
examiner's supervisor, Xiao Wu can be reached on (571 ) 272-7761 . The fax 
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phone number for the organization where misapplication or proceeding is 
assigned is 703-872-9306. 

Information regarding the status of an application may be obtained from 
the Patent Application Information Retrieval (PAIR) system. Status information 
for published applications may be obtained from either Private PAIR or Public 
PAIR. Status information for unpublished applications is available through 
Private PAIR only. For more information about the PAIR system, see http://pair- 
direct.uspto.gov. Should you have questions on access to the Private PAIR 
system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll- 
free). 



J.H. 
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